-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[V1][Core] Add a cache hit threshold for requests #24520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[V1][Core] Add a cache hit threshold for requests #24520
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a cache hit threshold to gate requests, which is a useful optimization for disaggregated deployments. The implementation is mostly solid, covering configuration, API exposure, and the core scheduling logic.
I've identified a critical issue that could lead to a ZeroDivisionError in the scheduler when processing requests with empty prompts. Additionally, there's a code duplication issue in the protocol validation that should be addressed to improve maintainability. My detailed comments provide suggestions for fixing these issues.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
3425995 to
7c0485e
Compare
0b75346 to
8be6b61
Compare
|
@robertgshaw2-redhat self tag |
8be6b61 to
0400566
Compare
4d756b7 to
0c15acc
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
0c15acc to
0c9cb3f
Compare
0c9cb3f to
c087238
Compare
06abf34 to
eeae693
Compare
|
(also added to the PR description above) An additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker. |
|
This pull request has merge conflicts that must be resolved before it can be |
908c9fd to
e391053
Compare
b2553c7 to
97698ba
Compare
xref #26813 - a proposal to add a policy that if a request fails because remote KV can't be loaded, we just abort the request rather than falling back to doing the prefill work in the decode instance |
Good catch, @markmc. I'll comment there - we can possibly join forces. Are you reviewing this PR as well? |
|
@kfirwolfson |
|
@kfirwolfson looks very good to me! Thanks! |
Fix Gemini CR comments Add unit tests Move from SamplingParams to request unit test remake fix static code analysis rejects Fix unit test fix after local CR fix pre-commit reject add threshold to request logger and fix some calls to encode fix ruff Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
97698ba to
da39332
Compare
[V1][Core] Add a cache hit threshold for requests
Purpose
Introduce an optional KV-cache hit-rate gating mechanism, discussed in RFC #24256, to skip requests that are unlikely to benefit from prefill in P/D disaggregated deployments.
Edit: an additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker.
What this PR adds
--global-cache-hit-threshold([0.0–1.0], default0.0)cache_hit_threshold([0.0–1.0]) in incoming requestChatCompletionRequest/CompletionRequest(validated in the protocol layer)."cache_threshold"exposed via v1 engine API. Requests rejected by this gating return HTTP 200 withfinish_reason="cache_threshold"and no output tokens.VllmConfigandSchedulerConfig.[0.0, 1.0].Why
Backwards compatibility
0.0→ feature is disabled by default. No behavior change unless the threshold is set globally or per request.Test Plan
1) Unit Tests
Unit tests check the scheduler logic, including
2) E2E manual tests
Run vllm serve with
--global-cache-hit-threshold 0.8argument to set a some default value. We'll override it in most requests.Scheduler computes
hit_ratio = computed_tokens / prompt_tokensWe will send 4 requests. Note the order of sending them matters as the first request fills the cache other depend on
cache_hit_threshold: 0so it’s guaranteed to execute and populate the KV-cachecache_hit_threshold: 0.33cache_hit_thresholdfield, which means global value of0.8will take effectRequest 1) Warm the cache
This run uses
cache_hit_threshold: 0so it’s guaranteed to execute and populate the KV-cache for the base segment.Request 2) MISS case
Expected: HTTP 200 with
"finish_reason": "cache_threshold"Request 3) HIT case
Expected: normal generation (
"finish_reason"is not"cache_threshold").Request 4) MISS case using global threshold
Use global threshold set to
0.8Expected: HTTP 200 with
"finish_reason": "cache_threshold"Notes
Test Result
E2E Local smoke tests on a single node:
200withfinish_reason: "cache_threshold"and empty outputs.Request cmpl-410004b615a54d73b7e9f0deebf2b852-0 rejected: cache hit rate 0.28 < threshold 0.33 (request)Request cmpl-6d66ba796f9247fcadca54ae428bf790-0 rejected: cache hit rate 0.41 < threshold 0.80 (global)0.0and1.0(not detailed above)Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.